Objectives

Please add in appropriate graph titles and axis labels yourself.

Demo

You will need to install the packages car and leaps if you have not yet.

Example 1

We will use a subset of the bdims dataset in openintro package. The dataset that we will use is called d (see below).

library(openintro)
dim(bdims)
[1] 507  25
names(bdims)
 [1] "bia.di" "bii.di" "bit.di" "che.de" "che.di" "elb.di"
 [7] "wri.di" "kne.di" "ank.di" "sho.gi" "che.gi" "wai.gi"
[13] "nav.gi" "hip.gi" "thi.gi" "bic.gi" "for.gi" "kne.gi"
[19] "cal.gi" "ank.gi" "wri.gi" "age"    "wgt"    "hgt"   
[25] "sex"   

# Creating dataset for the demo
d = bdims[,c("wgt", "sex", "age", "sho.gi", "che.gi", 
"thi.gi", "bic.gi", "for.gi", "kne.gi")]

# See descriptions of the interested variables
?bdims

str(d)
'data.frame':   507 obs. of  9 variables:
 $ wgt   : num  65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
 $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
 $ sho.gi: num  106 110 115 104 108 ...
 $ che.gi: num  89.5 97 97.5 97 97.5 ...
 $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
 $ bic.gi: num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
 $ for.gi: num  26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
 $ kne.gi: num  34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...

# sex is not a factor when it is supposed to be
# a categorical variable; let's change it to be
# a factor
d$sex = as.factor(d$sex)

str(d)
'data.frame':   507 obs. of  9 variables:
 $ wgt   : num  65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
 $ sex   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
 $ sho.gi: num  106 110 115 104 108 ...
 $ che.gi: num  89.5 97 97.5 97 97.5 ...
 $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
 $ bic.gi: num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
 $ for.gi: num  26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
 $ kne.gi: num  34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...
# The data types of other variables seem correct.
# If needed, we can always modify the data types later.

We want to build a linear model with all, or a subset, of the variables in our dataset to predict the Weight of a person.

  1. Use the ggpairs() function in the GGally package to graph the pairwise relationships between the variables and make the plots for male and female separately. Why do we want to make the plots for male and female separately?

d$sex = as.factor(d$sex)
library(ggplot2)  
library(GGally)
# Instructors, please explain what the graph and the numbers 
# on the graph mean
ggpairs(d[, c(2:9, 1)], aes(colour = sex, alpha = 0.4), 
         upper = list(continuous = wrap("cor", size = 2.2)),
         lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) + 
  theme(axis.text.x = element_text(angle = 45))

Answer the following questions with the scatterplot matrix above.

  1. Should sex be included as one of the variables in the model?

Yes, since the variable wgt seems to have different distribution for different values of sex.

  1. Explain why yes or why not: should all of these variables be included in our model as predictors: che.gi and sho.gi and bic.gi and for.gi?

Most likely not; they have strong pairwise correlations (all around or above .9). Collinearity between the variables will make the model unidentifiable. (Preceptors: students do not understand what collinearity mean: please explain in simple terms; e.g., there are infinitely many possible combination values for the estimated betas. This makes interpretation of the beta’s infeasible.)

  1. Do you expect age to be a better predictor compared to other continuous variables in the dataset?

Probably not since the scatterplot for weight v.s. age does not seem very linear. The correlation between the two variable is also low (which is expected from the scatterplot).

  1. Use the regsubsets() function in the leap package to perform model selections with the BIC, adjusted R-squared, and Mallow’s Cp criteria, searching through all the possible subsets of the predictors. (Preceptor: Please explain what regsubsets() does.)
library(leaps)
# library(car)

str(d)
'data.frame':   507 obs. of  9 variables:
 $ wgt   : num  65.6 71.8 80.7 72.6 78.8 74.8 86.4 78.4 62 81.6 ...
 $ sex   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
 $ sho.gi: num  106 110 115 104 108 ...
 $ che.gi: num  89.5 97 97.5 97 97.5 ...
 $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...
 $ bic.gi: num  32.5 34.4 33.4 31 32 33 42.4 34.1 33 36.5 ...
 $ for.gi: num  26 28 28.8 26.2 28.4 28 32.3 28 26 29.2 ...
 $ kne.gi: num  34.5 36.5 37 37 37.7 36.6 40.1 39.2 35.5 38.3 ...

# For each given number of predictors, find the best model
g <- regsubsets(wgt~., data=d, really.big=F, method='exhaustive', nvmax=8)


# set really.big=T if your design matrix is very big;
# can change option for method to forward selection, backward selection or sequential replacement to search when dataset is very big.
plot(summary(g)$bic, 
     main = 'BIC for best model for each given number of predictors',
     xlab = 'Given number of predictors', 
     ylab = 'BIC')


plot(summary(g)$adjr2, main = 'Adjusted R-square for best model \nfor each given number of predictors', xlab = 'Given number of predictors', ylab = 'Adjusted R-square') # adjusted R^2 chooses a different model


plot(summary(g)$cp, main = 'Mallows Cp for best model for each given number of predictors', xlab = 'Given number of predictors', ylab = 'Mallows Cp')
abline(a = 1, b = 1, lty = "dashed")

Check what variables are included in each “best” model.

summary.g <- summary(g)

# The best model for each number of predictors included in model
as.data.frame(summary.g$outmat)

Another way to find out the best model with 4 predictors and that with 5 predictors:

coef(g, id = 4)
(Intercept)        sex1      che.gi      thi.gi      kne.gi 
-83.6121786   5.1971057   0.6746304   0.6395871   1.4059045 

coef(g, id = 5)
(Intercept)        sex1      sho.gi      che.gi      thi.gi 
-87.4847777   4.2805446   0.1742738   0.5549794   0.6234783 
     kne.gi 
  1.3381446 

It looks like all the procedures that we used point us toward two possible models (in R’s notation):

Model 1: wgt ~ sex + che.gi + thi.gi + kne.gi

Model 2: wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi

mod1 = lm(wgt ~ sex + che.gi + thi.gi + kne.gi, data=d)
summary(mod1)

Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.345  -2.515  -0.218   2.201  12.138 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -83.61218    2.61177 -32.014  < 2e-16 ***
sex1          5.19711    0.61496   8.451 3.14e-16 ***
che.gi        0.67463    0.03247  20.774  < 2e-16 ***
thi.gi        0.63959    0.05904  10.834  < 2e-16 ***
kne.gi        1.40590    0.09969  14.103  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.728 on 502 degrees of freedom
Multiple R-squared:  0.9226,    Adjusted R-squared:  0.922 
F-statistic:  1496 on 4 and 502 DF,  p-value: < 2.2e-16

mod2 = lm(wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi, data=d)
summary(mod2)

Call:
lm(formula = wgt ~ sex + sho.gi + che.gi + thi.gi + kne.gi, data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.9278  -2.5046  -0.2666   2.0874  12.2488 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.48478    2.78306 -31.435  < 2e-16 ***
sex1          4.28054    0.65577   6.527 1.64e-10 ***
sho.gi        0.17427    0.04704   3.705 0.000235 ***
che.gi        0.55498    0.04552  12.193  < 2e-16 ***
thi.gi        0.62348    0.05846  10.664  < 2e-16 ***
kne.gi        1.33814    0.10013  13.364  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.682 on 501 degrees of freedom
Multiple R-squared:  0.9247,    Adjusted R-squared:  0.9239 
F-statistic:  1230 on 5 and 501 DF,  p-value: < 2.2e-16

The function vcov() gives the pairwise covariance between the estimates of two coefficients; e.g., the 3rd row and 4th column of vcov(mod2) gives the covariance of \(\hat{\beta}_{sho.gi}\) and \(\hat{\beta}_{che.gi}\). The numbers on the diagonal of vcov(mod2) give the variances of the \(\beta\) estimates; e.g., the 3rd row and 3rd column of vcov(mod2) gives us the variance of \(\hat{\beta}_{sho.gi}\).

round(vcov(mod1), 4)
            (Intercept)    sex1  che.gi  thi.gi  kne.gi
(Intercept)      6.8213  0.5443 -0.0264 -0.0112 -0.1093
sex1             0.5443  0.3782 -0.0153  0.0208 -0.0135
che.gi          -0.0264 -0.0153  0.0011 -0.0008 -0.0005
thi.gi          -0.0112  0.0208 -0.0008  0.0035 -0.0034
kne.gi          -0.1093 -0.0135 -0.0005 -0.0034  0.0099

round(vcov(mod2), 4) 
            (Intercept)    sex1  sho.gi  che.gi  thi.gi
(Intercept)      7.7454  0.7895 -0.0492  0.0080 -0.0064
sex1             0.7895  0.4300 -0.0116 -0.0069  0.0214
sho.gi          -0.0492 -0.0116  0.0022 -0.0015 -0.0002
che.gi           0.0080 -0.0069 -0.0015  0.0021 -0.0006
thi.gi          -0.0064  0.0214 -0.0002 -0.0006  0.0034
kne.gi          -0.0875 -0.0087 -0.0009  0.0001 -0.0032
             kne.gi
(Intercept) -0.0875
sex1        -0.0087
sho.gi      -0.0009
che.gi       0.0001
thi.gi      -0.0032
kne.gi       0.0100

Note that the correlation of two variables X and Y is defined as

\[ cor(X, Y) = \frac{cov(X,Y)}{sd(X)sd(Y)} \]

We already saw in part (c) that the sho.gi and che.gi are highly positively correlated and as a result the \(\hat{\beta}_{sho.gi}\) and \(\hat{\beta}_{che.gi}\) are highly negatively correlated.

# a correlation(beta_i, beta_j) close to 1 or -1 is bad since it implies 
# collinearity between X_i and X_j columns; 
# we do not want to see high values for the correlations.

# The correlation between beta_hat.sho.gi and beta_hat.che.gi is
vcov(mod2)["sho.gi", "che.gi"]/sqrt(vcov(mod2)["sho.gi", "sho.gi"])/sqrt(vcov(mod2)["che.gi", "che.gi"])
[1] -0.7095983

# this is high enough to get our attention

Also, since including the extra term sho.gi does not increase adjusted R-squared much, we favor model 1.

  1. Investigate the residual plots for model 1.
plot(mod1, which=1:2)

# which = 1 gives the residuals v.s. fitted values scatterplot
# which = 2 gives the qqplot for the residuals

hist(mod1$residuals, freq=F, breaks=30)

The residuals seem have a distribution that is close to Normal so that is a good sign. However, we see that the residuals have a slightly curve up (like a parabola opening up) pattern. Any non-random pattern in residual v.s. fitted y-value plot is not desirable; however, this pattern is not very strong so this is not too bad. We will try to improve on this in part (h).

  1. Now look at the pairs plot again should we include more terms? Say, we want to try to include an interaction term; what kind of interaction term do we want to add?
 ggpairs(d[,c('sex', 'che.gi', 'thi.gi','kne.gi', 'wgt')], 
         upper = list(continuous = wrap("cor", size = 3)),
         lower = list(continuous = wrap("points", alpha = 0.3, size=0.1)),
         aes(colour = sex, alpha = 0.4))

None of the three x-variables seem to have very different slopes for the two gender groups on the scatterplots for wgt v.s. the x-variables. According to the scatterplots thi.gi might be the one that seem to be more likely to have different slopes for females and males. Therefore, we will try to add the interaction term thi.gi:sex.

  1. Add thi.gi:sex to your model and fit the model again. Is the model with the interaction term an improvement on the model without?

Looking at the relationship between weights and thigh girth on the scatterplot we see that the model could benefit from including an interaction term. Lets compare the model with and the model without the interaction term:

mod3 = lm(wgt ~ sex + che.gi + thi.gi + kne.gi + sex:thi.gi, data=d)

# compare R-squared's
summary(mod3)

Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi + sex:thi.gi, 
    data = d)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4106  -2.4585  -0.2282   2.2415  12.0997 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -79.88735    3.19529 -25.002  < 2e-16 ***
sex1         -3.38022    4.30838  -0.785   0.4331    
che.gi        0.67014    0.03245  20.650  < 2e-16 ***
thi.gi        0.56871    0.06860   8.290 1.05e-15 ***
kne.gi        1.42619    0.09990  14.277  < 2e-16 ***
sex1:thi.gi   0.15143    0.07529   2.011   0.0448 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.717 on 501 degrees of freedom
Multiple R-squared:  0.9232,    Adjusted R-squared:  0.9224 
F-statistic:  1205 on 5 and 501 DF,  p-value: < 2.2e-16
summary(mod1)

Call:
lm(formula = wgt ~ sex + che.gi + thi.gi + kne.gi, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.345  -2.515  -0.218   2.201  12.138 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -83.61218    2.61177 -32.014  < 2e-16 ***
sex1          5.19711    0.61496   8.451 3.14e-16 ***
che.gi        0.67463    0.03247  20.774  < 2e-16 ***
thi.gi        0.63959    0.05904  10.834  < 2e-16 ***
kne.gi        1.40590    0.09969  14.103  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.728 on 502 degrees of freedom
Multiple R-squared:  0.9226,    Adjusted R-squared:  0.922 
F-statistic:  1496 on 4 and 502 DF,  p-value: < 2.2e-16

Adjusted R-squared values are about the same for the two models.

# look at p-value which is on the borderline of .05
anova(mod1, mod3)

p-value is on the borderline of 5% so no conclusion can be made about whether the coefficient for the interaction is zero or not.

Let’s compare the BIC values:

# Compare BICs
extractAIC(mod1, k=log(dim(d)[1]))
[1]    5.000 1360.404
extractAIC(mod3, k=log(dim(d)[1])) # This actually increase BIC
[1]    6.000 1362.556

Adding the interaction term actually increases the BIC value. Thus, for simplicity we favor model 1.

  1. What is the model matrix of the model 1? What is the mathematical model for model 1?
head(model.matrix(mod1))
  (Intercept) sex1 che.gi thi.gi kne.gi
1           1    1   89.5   51.5   34.5
2           1    1   97.0   51.5   36.5
3           1    1   97.5   57.3   37.0
4           1    1   97.0   53.0   37.0
5           1    1   97.5   55.4   37.7
6           1    1   99.9   57.5   36.6
colnames(model.matrix(mod1))
[1] "(Intercept)" "sex1"        "che.gi"      "thi.gi"     
[5] "kne.gi"     

The mathematical model for model 1 is

\[ Wight = \beta_0 + \beta_{1_{male}}1_{male} + \beta_{che.gi}che.gi +\beta_{thi.gi}thi.gi +\beta_{kne.gi}kne.gi + error \]

Exercises

Please fill in the missing graphic titles and labels yourself.

Question 1

Now it is your turn! Use the dataset below and build a linear model with the variables in the dataset to predict the Height of a person. Choose your champion model based on the BIC, Adjusted \(R^2\) and Mallow’s Cp values (we will not use cross-validation here). Looking at the residual plots of your champion model to see if model assumptions are violated. Consider adding possible interaction terms to improve your model.

Note: You can turn off the warnings produced by ggpairs() by setting message = F, warning = F for the code chunk.

b = bdims[,c("hgt", "sex", "age", "bii.di", "bit.di", "che.de", "kne.di", "ank.di",
             "wai.gi", "nav.gi", "hip.gi", "thi.gi")]

dim(b)
[1] 507  12
str(b)
'data.frame':   507 obs. of  12 variables:
 $ hgt   : num  174 175 194 186 187 ...
 $ sex   : int  1 1 1 1 1 1 1 1 1 1 ...
 $ age   : int  21 23 28 23 22 21 26 27 23 21 ...
 $ bii.di: num  26 28.5 28.2 29.9 29.9 27 30 29.8 26.5 28 ...
 $ bit.di: num  31.5 33.5 33.3 34 34 31.5 34 33.2 32.1 34 ...
 $ che.de: num  17.7 16.9 20.9 18.4 21.5 19.6 21.9 21.8 15.5 22.5 ...
 $ kne.di: num  18.8 20.6 19.7 20.9 20.7 18.8 20.8 21 18.9 21.1 ...
 $ ank.di: num  14.1 15.1 14.1 15 14.9 13.9 15.6 14.6 13.2 15 ...
 $ wai.gi: num  71.5 79 83.2 77.8 80 82.5 82 76.8 68.5 77.5 ...
 $ nav.gi: num  74.5 86.5 82.9 78.8 82.5 80.1 84 80.5 69 81.5 ...
 $ hip.gi: num  93.5 94.8 95 94 98.5 95.3 101 98 89.5 99.8 ...
 $ thi.gi: num  51.5 51.5 57.3 53 55.4 57.5 60.9 56 50 59.8 ...

b$sex = as.factor(b$sex)

# There are too many variables for plotting them all
# on the same graph, so we will make two graphs

ggpairs(b[,c(2, 3:7, 1)], aes(colour = sex, alpha = 0.4), 
        upper = list(continuous = wrap("cor", size = 2.4)),
        lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) + 
  theme(axis.text.x = element_text(angle = 45, size = 6))


ggpairs(b[,c(2, 8:12, 1)], aes(colour = sex, alpha = 0.4), 
        upper = list(continuous = wrap("cor", size = 2.4)),
        lower = list(continuous = wrap("points", alpha = 0.3, size=0.1))) + 
  theme(axis.text.x = element_text(angle = 45, size = 6))

g <- regsubsets(hgt~., data=b, really.big=F, method='exhaustive', nvmax=12)

plot(summary(g)$bic, 
     main = 'BIC for best model for each given number of predictors',
     xlab = 'Given number of predictors', 
     ylab = 'BIC')


plot(summary(g)$adjr2, main = 'Adjusted R-square for best model \nfor each given number of predictors', xlab = 'Given number of predictors', ylab = 'Adjusted R-square') # adjusted R^2 chooses a different model


plot(summary(g)$cp, main = 'Mallows Cp for best model for each given number of predictors', xlab = 'Given number of predictors', ylab = 'Mallows Cp')
abline(a = 1, b = 1, lty = "dashed")


coef(g, id = 8)
(Intercept)        sex1         age      bii.di      bit.di 
100.3259546   8.2318518  -0.1607032   0.6926682   1.0475037 
     che.de      ank.di      wai.gi      thi.gi 
  0.5658327   1.9707962  -0.1102385  -0.1905396 
coef(g, id = 9)
(Intercept)        sex1         age      bii.di      bit.di 
 98.9916772   8.8607781  -0.1649853   0.6965493   0.8197194 
     che.de      ank.di      wai.gi      hip.gi      thi.gi 
  0.5696904   1.9328071  -0.1763574   0.2492287  -0.3702857 

mod1 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + thi.gi, data=b)
summary(mod1)

Call:
lm(formula = hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + 
    wai.gi + thi.gi, data = b)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.1582  -4.0421  -0.0961   4.0007  15.5024 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 100.32595    4.75057  21.119  < 2e-16 ***
sex1          8.23185    1.00712   8.174 2.49e-15 ***
age          -0.16070    0.03125  -5.142 3.92e-07 ***
bii.di        0.69267    0.15967   4.338 1.74e-05 ***
bit.di        1.04750    0.19986   5.241 2.36e-07 ***
che.de        0.56583    0.17507   3.232  0.00131 ** 
ank.di        1.97080    0.32074   6.145 1.64e-09 ***
wai.gi       -0.11024    0.05175  -2.130  0.03364 *  
thi.gi       -0.19054    0.08670  -2.198  0.02843 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.678 on 498 degrees of freedom
Multiple R-squared:  0.6415,    Adjusted R-squared:  0.6357 
F-statistic: 111.4 on 8 and 498 DF,  p-value: < 2.2e-16

mod2 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + hip.gi + thi.gi, data=b)
summary(mod2)

Call:
lm(formula = hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + 
    wai.gi + hip.gi + thi.gi, data = b)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.6949  -4.2401  -0.0461   4.0656  16.3823 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 98.99168    4.77182  20.745  < 2e-16 ***
sex1         8.86078    1.04374   8.489 2.41e-16 ***
age         -0.16499    0.03120  -5.288 1.85e-07 ***
bii.di       0.69655    0.15908   4.379 1.46e-05 ***
bit.di       0.81972    0.22472   3.648 0.000292 ***
che.de       0.56969    0.17442   3.266 0.001165 ** 
ank.di       1.93281    0.32000   6.040 3.02e-09 ***
wai.gi      -0.17636    0.05977  -2.951 0.003320 ** 
hip.gi       0.24923    0.11399   2.186 0.029256 *  
thi.gi      -0.37029    0.11924  -3.105 0.002010 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.656 on 497 degrees of freedom
Multiple R-squared:  0.6449,    Adjusted R-squared:  0.6385 
F-statistic: 100.3 on 9 and 497 DF,  p-value: < 2.2e-16

round(vcov(mod1), 4)
            (Intercept)    sex1     age  bii.di  bit.di
(Intercept)     22.5679  0.7052 -0.0018 -0.0640 -0.3125
sex1             0.7052  1.0143  0.0100  0.0313 -0.0073
age             -0.0018  0.0100  0.0010 -0.0002 -0.0008
bii.di          -0.0640  0.0313 -0.0002  0.0255 -0.0155
bit.di          -0.3125 -0.0073 -0.0008 -0.0155  0.0399
che.de          -0.0838 -0.0310 -0.0006 -0.0005  0.0014
ank.di          -0.5838 -0.1555 -0.0008 -0.0066 -0.0118
wai.gi           0.0502 -0.0272 -0.0005 -0.0010 -0.0009
thi.gi          -0.0910  0.0476  0.0011  0.0003 -0.0053
             che.de  ank.di  wai.gi  thi.gi
(Intercept) -0.0838 -0.5838  0.0502 -0.0910
sex1        -0.0310 -0.1555 -0.0272  0.0476
age         -0.0006 -0.0008 -0.0005  0.0011
bii.di      -0.0005 -0.0066 -0.0010  0.0003
bit.di       0.0014 -0.0118 -0.0009 -0.0053
che.de       0.0307 -0.0049 -0.0038 -0.0025
ank.di      -0.0049  0.1029  0.0007 -0.0025
wai.gi      -0.0038  0.0007  0.0027 -0.0019
thi.gi      -0.0025 -0.0025 -0.0019  0.0075

round(vcov(mod2), 4) 
            (Intercept)    sex1     age  bii.di  bit.di
(Intercept)     22.7703  0.5244 -0.0006 -0.0646 -0.2466
sex1             0.5244  1.0894  0.0094  0.0316 -0.0372
age             -0.0006  0.0094  0.0010 -0.0002 -0.0006
bii.di          -0.0646  0.0316 -0.0002  0.0253 -0.0155
bit.di          -0.2466 -0.0372 -0.0006 -0.0155  0.0505
che.de          -0.0842 -0.0302 -0.0006 -0.0005  0.0012
ank.di          -0.5688 -0.1594 -0.0008 -0.0066 -0.0099
wai.gi           0.0683 -0.0357 -0.0005 -0.0010  0.0023
hip.gi          -0.0696  0.0328 -0.0002  0.0002 -0.0119
thi.gi          -0.0401  0.0236  0.0012  0.0001  0.0033
             che.de  ank.di  wai.gi  hip.gi  thi.gi
(Intercept) -0.0842 -0.5688  0.0683 -0.0696 -0.0401
sex1        -0.0302 -0.1594 -0.0357  0.0328  0.0236
age         -0.0006 -0.0008 -0.0005 -0.0002  0.0012
bii.di      -0.0005 -0.0066 -0.0010  0.0002  0.0001
bit.di       0.0012 -0.0099  0.0023 -0.0119  0.0033
che.de       0.0304 -0.0049 -0.0038  0.0002 -0.0026
ank.di      -0.0049  0.1024  0.0012 -0.0020 -0.0011
wai.gi      -0.0038  0.0012  0.0036 -0.0034  0.0006
hip.gi       0.0002 -0.0020 -0.0034  0.0130 -0.0094
thi.gi      -0.0026 -0.0011  0.0006 -0.0094  0.0142

vcov(mod2)["hip.gi", "thi.gi"]/sqrt(vcov(mod2)["hip.gi", "hip.gi"])/sqrt(vcov(mod2)["thi.gi", "thi.gi"])
[1] -0.68945
# -0.68945 is high enough to attract attention

plot(mod1, which=1:2)


hist(mod1$residuals, freq=F, breaks=30)



mod3 = lm(hgt ~ sex + age + bii.di + bit.di + che.de + ank.di + wai.gi + thi.gi + sex:thi.gi, data=b)